Who am I

I am a PhD student in linguistics with some previous experience in R and RStudio as well as basic statistics, but with this course I hope to gain more understanding in statistical methods and open data that I could use in my research. My GitHub repository.


Regression and model validation

For the tasks this week I wrangle data into data frames in the form required for analysis. I have used the data to analyse the relationship between a dependent and an independent variable by fitting a linear regression model. I can also make plots from the residuals to diagnose the appropriateness of the model.

## Warning: package 'ggplot2' was built under R version 3.3.3

Introduction

In this analysis we investigate the how the points scored in a statistics test by a student are explained by different learning approaches and the student’s attitude, age and gender. To do this, we shall analyse the explanatory powers of the different potential explanatory variables by building a regression model where exam points is the dependent variable.

Data overview

The data for this analysis is a survey dataset listing the learning approaches in a statistics class. More information on its collection can be found here. The data has 166 observations and 7 variables. Below follows a sample of what the dataset used the purposes of this study look like.

##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22

As can be seen, observations include the gender and age of the respondents as well as the attitude towards statistics, and max points. It should be noted that respondents whose score in Points equals 0 have been removed from the current 166-observation dataset. The scores for variables deep, strategic and surface approaches to learning are the means of the combined sums from individual variables: for example, deep represents the mean of the scores for the following variables.

deep <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")

Correlation analysis

Graphical overview

The following offers a overview visualisation of correlations within the data. It is for example noteworthy that there are twice as many women as men. In terms of correlations, at 0.437 the strongest correlation is that between Points and Attitude, noticeable among both genders. The lowest (absolute) correlation is between Attitude and Age. This suggests that attitude towards statistics learning does not depend on the age of the respondent, but attitude does play an important role in the points scored by the respondent. Finally, although low, two more correlations worth noting are the 0.146 correlation between Points and strategic approach, and the -0.144 correlation between Points and surface approach.

In addition, the correlation between surface and deep approach, although minimal among women, is -0.622 among men, suggesting that men who use surface approach are less likely to use deep approach and vice versa. As seen in the figure, there is also some negative correlation between surface approach and Attitude, again more among men, as well as surface and strategic approach.

Regression model

Based on the overview above, we shall first investigate the potential of a lineral model that assumes Points are explained by the variables Attitude, surface and strategy. The results of this model are given below:

## 
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## surf        -0.58607    0.80138  -0.731  0.46563    
## stra         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to the above results, the only variable for which the p-value is <0.001 is Attitude, the variables relating to approach being > 0.05. This means that for these variables we fail to reject the null hypothesis. Thus, below we shall fit a new model with only the statistically significant variable Attitude.

Relationship between explanatory and target variable

The results above suggested that only Attitude can be used to explain the dependent variable Points. In other words, our model is:

Points = α + βAttitude + ε

where

  • α is Points when Attitude = 0

  • β is our coefficient

  • ε is an unobservable random, normally distributed variable

The results and a plot illustrating this model are given below:

## 
## Call:
## lm(formula = Points ~ Attitude, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The line is positioned so as to have the lowest residuals (differences between observed value and predicted value), but quite a few observations fall further outside of the line. Indeed, the multiple R-squared is 0.1906, indicating that only 19 % of the Points-observations are explained by the variable Attitude. However, the p-value <0.001 gives reason to believe the variable is nonetheless statistically significant.

Diagnostic plots

The current model assumes that Attitude can be explained by a single explanatory variable, Attitude, and that linear regression model (as opposed to e.g. a polynomial regression model) is sufficient for explaining and predicting the observations. To diagnose the model, we shall plot the residuals. Plotting the residuals and fitted, we can see that the model is reasonable.

In the Residuals vs Fitted and Residuals vs Leverage plots, the observations are seemingly evenly distributed and there is no visible pattern that would suggest the need for a polynomial regression model. In the Normal Q-Q plot, most the observations line up on the same line, with some deviation only among the lowest and the highest values. This indicates the model is adequate for explaining our dependent variable and that we do not need to resort to a more complex model.


Logistic regression

In this section I use a logistic regression model to study what variables impact alcohol consumption.

Data overview

The data for the present study is a dataset on alcohol consumption in secondary school students in two schools in Portugal. More information (and download) is available here. The variables include varibales on e.g. the student’s background (family size, education and employment of parents), behaviour (e.g. amount of freetime, health status) as wekk as performance in the subjects of Maths and Portuguese. In the dataset of this study, the original data has already been modified so that variables for alcohol consumption on weekdays and alcohol consumption on weekends has been interpreted into a single variable alc_use, which is the mean of the two. Further, a binary column high_usehas been created based on alc_use, where “high_use = TRUE” signifies a mean alcohol consumption greater than 2.

A list of all variables is shown below:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Several of the variables are numeric, using a Likert-scale type 1-5 range (very low - very high), including the values for alcohol consumption. Other variables are binary yes/no data, such as whether the student is in a romantic relationship.

Variables impacting alcohol consumption

Hypotheses

The purpose of this study is to explore how variables predict high/low alcohol consumption among the observations. I hypothesise that alcohol consumption is related to the following four variables:

  • Sex: it is expected that male students have a higher consumption rate than female students.
  • Study time: students who study longer per week are expected to have a lower alcohol consumption (negative correlation).
  • Going out with friends: students who go out will have a higher alcohol consumption rate (positive correlation).
  • Absences: students with many absences from school are expected to have higher alcohol consumption (positive correlation).

The hypotheses above reflect only my current intuitions and are not based on research literature on alcohol consumption oron previous familiarisations with the data. It should also be mentioned that the purpose of this analysis is to study the relationship, but not even a strong relationship is necessarily a sign of causality, let alone a sign of which variable is the cause and which the effect.

Distribution of chosen variables

1. Sex

The first hypothesis is that male students have higher alcohol consumption than female students. It should be noted that the number of females (198) is around the same as males (184). Below the numbers of high consumption have been given in proportions.

The figure shows that our hypothesis is supported. As can be seen, male students have a higher percentage off alcohol consumption, supporting the hypothesis that there are more men with high consumption rates than women.

2. Study time

Our second hypothesis is that students who study more are less likely to have a high alcohol consumption rate (negative correlation). Study time is measured with the following 4-point scale, representing study hours per week: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. In the figure below, the findings are again split across sexes. A boxplot was chosen as in spite of the values being numbers, Likert-type data is better thought of as ordinal rather than continuous data, and any averages or quartiles produced from it are misguiding at best.

The graph indicates that high consumption is indeed more common among students with lower weekly totals of study time. Interestingly, male students have a break in the pattern, with students who study 5-10 hours per week having a noticeably lower number of high consumers than all other groups apart from females studying more than 10 hours. However, our overall hypothesis is still supported.

3. Going out with friends

Here it is hypothesised that students who go out are more likely to have high consumption rates. The frequency of going out with friends is here measured using a numeric 5-step Likert-scale, where 1 = very low, and 5 = very high. Again, we examine this first in boxplots.

It appears that high consumption is more common among students who go out frequently, but this is especially pronounced among male students, whereas the female students who go out the most have a lower number of high consumers than the second-highest female group. Thus, our hypothesis is applicable mainly for male students.

4. Absences

Another assumption that we have is that students with high alcohol consumption are also more likely to have more absences from school. Absences are here a numeric number representing the number of absences. We shall explore the numerical value in a boxplot, where the dotted line indicates the mean.

The means indicate a difference, but this may be because of outliers. Not counting the outliers marked as dots beyond the whiskers, the boxplots above suggest that there may be a relationship between absences and alcohol among male students, but among female students the differences are quite small, the medians being about even and the main differences being visible in the top 50%. Based only on this, we cannot outright claim that this supports our hypothesis.

Fitting a logistic regression model

In this section we shall fit a logistic regression model that could explain and predict high alcohol use among the population. We shall at least not yet remove any of the variables, and our model and its statistical summary is thus as follows.

high_use ~ sex + studytime + goout + absences

## 
## Call:
## glm(formula = high_use ~ sex + studytime + goout + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9103  -0.7892  -0.5021   0.7574   2.6021  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.20483    0.59429  -5.393 6.94e-08 ***
## sexM         0.78173    0.26555   2.944 0.003242 ** 
## studytime   -0.42116    0.17058  -2.469 0.013551 *  
## goout        0.72677    0.11994   6.059 1.37e-09 ***
## absences     0.07811    0.02244   3.481 0.000499 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 381.31  on 377  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 4

Based on the above, our highly statistically significant p<0.001 variables are goout and absences, with the variable sex having a statistically significant p-value of <0.01, and studytime having a statistically significant p-value of <0.05. In other words, we fail to reject the null hypotheses for all our variables.

The coefficients of the variables are as follows:

## (Intercept)        sexM   studytime       goout    absences 
## -3.20483157  0.78173290 -0.42116184  0.72676824  0.07810746

As hypothesised, study time is the only explanatory variable with negative correlation. Absences has the lowest absolute coefficient. We shall take the exponents of the coefficients to analyse the odds ratios of the variables as well as their confidence intervals.¨

Odds are calculated as p/(1-p), where P is probability. The odds ratio is the ratio of the odds of two values, thus quantifying their relationship. The odds ratio also represents the increase in odds for an increase in the variable by one unit. The confidence interval then explains the precision of the odds ratio. A 97.5 % confidence interval indicates a range for how precise the odds ratio is: the interval indicates a range with 97.5 % probability of this calculation being correct for this population.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Waiting for profiling to be done...
##              OddsRatio      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM        2.18525582 1.30370562 3.7010763
## studytime   0.65628388 0.46510383 0.9099505
## goout       2.06838526 1.64470314 2.6349716
## absences    1.08123884 1.03577383 1.1324143

The odds ratios show the increase in the odds for increased high alcohol consumption for every increase in category of study time, going out or absences. For sex, it indicates that the odds of a man having high alcohol consumption is 2.18 higher than women. Ranked from highest to lowest odds ratio, i.e. the most significant change in odds per unit, sex is at the top and study time the lowest. However, the confidence intervals are fairly wide, meaning that a large range is needed to be 97.5 % certain of the matter when it comes to the total population.

Predictive accuracy

Finally, we shall evaluate the prediction accuracy of our logistic regression model. That is, if the model was used to predict each observation, with what frequency would it be able to correctly categorize an observation as having either high or low alcohol consumption rate. In this case the accuracy of our model is as follows (in n of observations and proportions):

##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65968586 0.04188482 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.82984293 0.17015707 1.00000000

The so-called confusion matrix above (with rounded proportions) indicates that our model categorises 66.0 % (N = 252) of the observations have a low consumption rate and are by the model predicted as such. Correspondingly, 12.8 % (N = 29) of observations have both high consumption and are accurately predicted as such. The remaining were predicted incorrectly (have high use but were predicted as low, or vice versa).

In short, approx. 78.9 % of the observations were predicted correctly and 21.2 % (0.2120419) were predicted incorrectly. Visualised, the data appears as follows:

Supposing that we had not fitted a logistic regression model but simply guessed our way, the number of correct guesses can be different (ideally lower) than the number of corrects predicted by the model. If we assumed none of the students to have a high alcohol consumption rate (probability = 0): our error rate would be as the following:

## [1] 0.2984293

We shall also test the predictive power of a 50/50 guessing method where the probability for a student to have a high alcohol consumption rate is assumed to be 0.5 (i.e. the rate is as likely to be high as to be low). We calculate the mean of incorrect guesses as:

## [1] 0

Interestingly, our calculation would indicate a mean of incorrect guesses is 0, i.e. that all guesses are correct. I find this somewhat suspicious, but unfortunately the investigation of the accuracy of this falls outside of the scope of this analysis.

Bonus

10-fold validation

Lastly, we shall use the so-called 10-fold validation to assess the accuracy of the model fitted above. This method is a form of cross-validation and the main goal is therefore to validate the model on an independent data set that has not been used for training the model. In 10-fold validation, the data is split into 10 parts, from which 9 parts are used for training, and the remaining 1 for testing. The process is then repeated with a different part functioning as testing set. The mean prediction error is as follows:

## [1] 0.2172775

In summary, the predictive of the model on test data is very close to the predictive power of our model (0.2120419). The predictive error is also lower than that of a similar model tested in the DataCamp exercises leading up to this analysis, where the explanatory variables were assumed to be sex, absences and failures in classes.


Clustering and classification

In this analysis, we apply clustering and classification methods to study a dataset. Based on classification from training data we are able to classify also training data.

Introduction

## Warning: package 'MASS' was built under R version 3.3.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Warning: package 'tibble' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## nasa():   dplyr, GGally
## select(): dplyr, MASS
## Warning: package 'corrplot' was built under R version 3.3.3
## corrplot 0.84 loaded

Data description

From the R package “MASS”, the data we shall use for this analysis represents housing values in Boston from the 1970s. A quick overview of the variables is provided below, where the variables stand for the following (Source):

  • crim: per capita crime rate by town.

*zn: proportion of residential land zoned for lots over 25,000 sq.ft.

*indus: proportion of non-retail business acres per town.

  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox: nitrogen oxides concentration (parts per 10 million).

  • rm: average number of rooms per dwelling.

  • age: proportion of owner-occupied units built prior to 1940.

  • dis: weighted mean of distances to five Boston employment centres.

  • rad: index of accessibility to radial highways.

  • tax: full-value property-tax rate per $10,000.

  • ptratio: pupil-teacher ratio by town.

  • black: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.

  • lstat: lower status of the population (percent).

  • medv: median value of owner-occupied homes in $1000s.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The variable rad and the dummy variable chas are the only integer variables, the remaining being numerical.

Graphical overview

Scaling the data

Before we begin our analysis, we shall standardise the dataset. This means that we assume normal distribution and adjust the numeric values of variables so that mean = 0, all values indicating a distance from the mean in units of standard deviation. The data now looks as follows (cf. with table above):

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Before continuing, we shall modify the variable crim into a categorical factor variable based on its quantiles, seen above as its values for Min, 1st Qu., Median, 3rd Qu., and Max. The variable (renamed crime) now has the following factors and observations per factor:

## 
##      low  med_low med_high     high 
##      127      126      126      127

We shall also divide the dataset into a testing and training set. The training set will consist of a sample of 80 % of the whole dataset, the testing set the remaining 20 %.

##        zn               indus               chas         
##  Min.   :-0.48724   Min.   :-1.51549   Min.   :-0.27233  
##  1st Qu.:-0.48724   1st Qu.:-0.91166   1st Qu.:-0.27233  
##  Median :-0.48724   Median :-0.21089   Median :-0.27233  
##  Mean   : 0.06743   Mean   :-0.01142   Mean   : 0.03646  
##  3rd Qu.:-0.08527   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.58609   Max.   : 2.42017   Max.   : 3.66477  
##       nox                 rm                age          
##  Min.   :-1.42991   Min.   :-1.96641   Min.   :-2.08800  
##  1st Qu.:-0.86682   1st Qu.:-0.57839   1st Qu.:-0.85705  
##  Median :-0.23037   Median :-0.22578   Median : 0.42364  
##  Mean   : 0.06059   Mean   : 0.03899   Mean   : 0.02583  
##  3rd Qu.: 0.93249   3rd Qu.: 0.49723   3rd Qu.: 0.85972  
##  Max.   : 2.72965   Max.   : 3.47325   Max.   : 1.11639  
##       dis                  rad                tax          
##  Min.   :-1.2658165   Min.   :-0.98187   Min.   :-1.30676  
##  1st Qu.:-0.8095571   1st Qu.:-0.63733   1st Qu.:-0.75940  
##  Median :-0.4617172   Median :-0.52248   Median :-0.37521  
##  Mean   : 0.0004225   Mean   : 0.05625   Mean   : 0.02809  
##  3rd Qu.: 0.7440043   3rd Qu.: 1.65960   3rd Qu.: 1.52941  
##  Max.   : 2.5764502   Max.   : 1.65960   Max.   : 1.79642  
##     ptratio             black             lstat              medv         
##  Min.   :-2.70470   Min.   :-3.8792   Min.   :-1.4260   Min.   :-1.90634  
##  1st Qu.:-0.46446   1st Qu.: 0.2353   1st Qu.:-0.7636   1st Qu.:-0.54178  
##  Median : 0.25149   Median : 0.3883   Median :-0.1111   Median :-0.11773  
##  Mean   : 0.01058   Mean   : 0.1411   Mean   : 0.0495   Mean   : 0.09205  
##  3rd Qu.: 0.80578   3rd Qu.: 0.4406   3rd Qu.: 0.5986   3rd Qu.: 0.27641  
##  Max.   : 1.26768   Max.   : 0.4406   Max.   : 2.5426   Max.   : 2.98650  
##       crime   
##  low     :28  
##  med_low :24  
##  med_high:21  
##  high    :29  
##               
## 

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a classification method for continuous normally distributed classes such as in our scaled dataset in this analysis. Using crime as our target variable, we shall use LDA on our training set so as to later test it on the testing set

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2524752 0.2599010 0.2425743 
## 
## Group means:
##                   zn      indus         chas        nox          rm
## low       0.86772064 -0.9117255 -0.113254311 -0.8559192  0.43722815
## med_low  -0.05090227 -0.3336183 -0.002135914 -0.6209215 -0.09659554
## med_high -0.37943482  0.2454243  0.177625245  0.3913831  0.05252999
## high     -0.48724019  1.0171960 -0.111631099  1.0285199 -0.43801498
##                 age        dis        rad        tax    ptratio
## low      -0.8761555  0.8224485 -0.6825737 -0.7404465 -0.4516307
## med_low  -0.4178903  0.4167778 -0.5495071 -0.4866089 -0.1076168
## med_high  0.4423203 -0.3907026 -0.4054501 -0.2690322 -0.2077740
## high      0.8192450 -0.8464597  1.6373367  1.5134896  0.7798552
##                black        lstat        medv
## low       0.38004847 -0.769773881  0.51833638
## med_low   0.30284640 -0.166041183  0.01295565
## med_high  0.08102127 -0.005963686  0.10733745
## high     -0.93285252  0.905313017 -0.74792289
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11204850  0.486900816 -1.11026555
## indus    0.02430163 -0.360159169  0.29034052
## chas    -0.11344102 -0.006334850  0.20103049
## nox      0.44918572 -0.898793742 -1.24339451
## rm      -0.06031348 -0.056660379 -0.10599577
## age      0.25577131 -0.387833245 -0.16624840
## dis     -0.10488040 -0.225491698  0.31323980
## rad      2.95398512  0.932620317 -0.20078859
## tax     -0.02847357  0.127736250  0.61568770
## ptratio  0.12858149 -0.091210558 -0.34358310
## black   -0.13269809 -0.004660517  0.07358266
## lstat    0.17834616 -0.057512725  0.51580490
## medv     0.16039606 -0.377912300 -0.15299087
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9418 0.0428 0.0154

Based on the above, the 1st linear discriminant (LD1) explains as much as 95 % of the variance, LD2 explaining 3.6 % and LD3 1.2 %. Visualising the observations across LD1 and LD2, the data looks as follows:

Predicting classes in test data

Based on the classification data of the LDA above, we shall classify the observations of the test data, i.e. we are testing the predictive accuracy of the LDA on data not part of the model’s training data. The results of the prediction are as follows:

##           predicted
## correct    low med_low med_high high
##   low       18       7        3    0
##   med_low    0      14       10    0
##   med_high   0       7       14    0
##   high       0       0        0   29

Out of a total of 102 observations, the number of incorrectly classified observations is 30 = 4 + 9 + 9 + 5 + 2 + 1, i.e. 29.4 %.

Distances

Here we shall investigate the distances between observations. To do this, we begin with a fresh dataset, i.e. the original Bostond dataset prior to the modifications described above. We standardise the dataset and

data('Boston')
boston_scaled <- scale(Boston)
## [1] "Euclidean distance"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
## [1] "Manhattan distance"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2849  8.8390 13.0300 13.8700 18.1000 46.8900

Let us visualise the k-means of the dataset. Here the optimal number of clusters is determined by looking at the total of within cluster sum of squares (WCSS), namely, where the line in a line plot drops rapidly.

Based on the above, we select 2 as the optimal number of clusters. Visualised, the clusters appear as follows

Super-bonus

We shall create a 3D-plot of the data across the three linear dimensions, setting the colour to represent crime quantiles.

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## [1] 404  13
## [1] 13  3

Looking at the same model with color signifying the clusters:

## [1] 404  13
## [1] 13  3
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here the low and median low are merged into cluster 1, whereas high and median high make up the second cluster. A 3D view illustrates the heterogeneity of the clusters: cluster 1 has limited y-values (LD2), but a far range of x-values (LD1), whereas for cluster 2 the opposite is true. Both clusters have values on the z-value. However, it should be noted that LD3 had the smallest proportion of trace and also that of LD2 was quite low. In light of this, the plotting seems appropriate enough.


Dimensionality reduction techniques

Because analysing and visualising tens of dimensions is much more difficult, it is often useful to reduce the number of dimensions by merging correlating variables together. In this analysis we conduct a Principle Component Analysis _ and a_ Multiple Correspondence Analysis_._

Data overview

The data we use here was created as part of the UN’s Human Development Programme, reflecting the Human Development Index (HDI), Gross National Income, and variables such as the expected education level or percentage of women in labor in different countries. For an overview, the variables are:

## 'data.frame':    155 obs. of  8 variables:
##  $ eduratio: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ exped   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifexp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni     : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ matmor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adbirth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlperc: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

i.e.

  • eduratio: the ratio of the percentage of females vs. males with at least secondary education
  • labratio: the ratio of the percentages of females vs. males in the labour force
  • exped: expected years of education
  • lifexp: life expectancy at birth
  • gni_cap: Gross National Income per capita
  • matmor: maternal mortality ratio
  • adbirth: adolescent birth rate
  • parlperc: percentage of female representatives in parliament

Note that any countries with missing values for any variables have been deleted from the present data.

Principal Component Analysis with Non-Standardized Data

First we shall conduct a Principal Component Analysis (PCA) using the same data as above with non-standardized variables.

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

As seen in the biplot visualisation above, GNI clearly stands out from the other variables. However, this is due to this variable being numerically on a completely different scale than the other variable values, tens of thousands vs. <1 or <100% ratios. This causes the correlation to be unreliable (with only outliers showing) not to mention difficult to analyse. The data must thus be scaled.

Principal Component Analysis with Standardized Data

To amend for the abovementioned problem, we must conduct a PCA using a standardized set of the data, where mean = 0 and standard deviation 1. The results are visualised using PC1 and PC2, along with the percentages of explained variance for both principal components. Together the components explain 69.8 of the variance.

##     eduratio          labratio           exped             lifexp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       gni              matmor           adbirth           parlperc      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)"  "PC4 (7.6%)"  "PC5 (5.5%)" 
## [6] "PC6 (3.6%)"  "PC7 (2.6%)"  "PC8 (1.3%)"

From this biplot we can see that expected level of education correlates - expectedly - with the ratio of percentages of females vs. males with secondary education as well as gni and life expectancy at birth. Further, these four variables correlate negatively with maternal mortality and adolescent birth rate. We also see that the ratio of percentages of females vs males in labour correlates with the percentage of female representatives in the parliament.

Interpretations

I interpret the two principal components identified above as follows:

  • PC1: life and welfare
  • PC2: representation

Along PC1 (53.6% of variation) we have variables that represent life and mortality, but also education and gni, Meanwhile, PC2 (16.2% of variation) represents two variables tied to the representation of females in worklife and politics. Female representation alone does not guarantee low maternal mortality or adolescent birth rate (cf. Mozambique in biplot above), but high education levels or gni also do not guarantee equal representation (cf. Iran or Qatar)

Multiple Correspondence Analysis

Here we shall switch datasets and use data from the FactoMineR-package describing tea consumption and how tea is usually enjoyed. We shall analyse the data through Multiple Correspondence Analysis (MCA). The dataset has 36 variables in total, but here we are interested in the place where and time when the respondents of the survey consume tea. Our variables are: “SPC”, “price”, “frequency”, “how”, “How”. The results appear as follows:

## Warning: attributes are not identical across measure variables; they will
## be dropped

## 
## Call:
## MCA(X = tea2) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.293   0.263   0.248   0.243   0.233   0.221
## % of var.              9.772   8.766   8.268   8.085   7.774   7.357
## Cumulative % of var.   9.772  18.537  26.806  34.891  42.665  50.021
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.208   0.197   0.191   0.183   0.166   0.159
## % of var.              6.936   6.553   6.382   6.103   5.517   5.306
## Cumulative % of var.  56.957  63.509  69.891  75.994  81.511  86.817
##                       Dim.13  Dim.14  Dim.15
## Variance               0.139   0.129   0.128
## % of var.              4.618   4.312   4.253
## Cumulative % of var.  91.435  95.747 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            |  0.683  0.530  0.204 |  0.325  0.133  0.046 | -0.369  0.183
## 2            |  0.480  0.262  0.083 | -0.147  0.027  0.008 | -0.360  0.174
## 3            | -0.240  0.065  0.017 | -0.551  0.385  0.087 | -0.611  0.501
## 4            |  0.231  0.061  0.033 | -0.259  0.085  0.041 | -0.370  0.184
## 5            | -0.033  0.001  0.001 | -0.047  0.003  0.001 |  0.060  0.005
## 6            |  0.231  0.061  0.033 | -0.259  0.085  0.041 | -0.370  0.184
## 7            |  0.665  0.502  0.122 |  0.526  0.351  0.076 |  0.502  0.339
## 8            |  0.381  0.165  0.041 |  0.080  0.008  0.002 | -0.739  0.733
## 9            |  0.712  0.576  0.155 | -0.159  0.032  0.008 |  1.220  2.002
## 10           |  0.359  0.147  0.049 |  0.183  0.042  0.013 |  1.131  1.720
##                cos2  
## 1             0.060 |
## 2             0.047 |
## 3             0.107 |
## 4             0.084 |
## 5             0.002 |
## 6             0.084 |
## 7             0.069 |
## 8             0.156 |
## 9             0.456 |
## 10            0.488 |
## 
## Categories (the 10 first)
##                 Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## employee     | -0.325  1.421  0.026 -2.785 | -0.196  0.572  0.009 -1.673 |
## middle       |  0.472  2.029  0.034  3.203 |  1.314 17.501  0.266  8.910 |
## non-worker   | -0.254  0.942  0.018 -2.291 | -0.038  0.023  0.000 -0.341 |
## other worker |  0.618  1.740  0.027  2.858 | -1.156  6.772  0.095 -5.341 |
## senior       |  1.211 11.680  0.194  7.612 |  0.080  0.057  0.001  0.506 |
## student      | -0.751  8.985  0.172 -7.167 | -0.182  0.585  0.010 -1.731 |
## workman      |  1.201  3.939  0.060  4.240 | -0.465  0.659  0.009 -1.643 |
## 1/day        |  0.611  8.075  0.173  7.196 | -0.520  6.515  0.125 -6.122 |
## 1 to 2/week  |  0.344  1.184  0.020  2.466 |  0.064  0.045  0.001  0.457 |
## +2/day       | -0.529  8.080  0.205 -7.837 |  0.038  0.046  0.001  0.558 |
##               Dim.3    ctr   cos2 v.test  
## employee      0.375  2.228  0.034  3.207 |
## middle       -0.369  1.466  0.021 -2.504 |
## non-worker   -0.188  0.609  0.010 -1.694 |
## other worker -1.094  6.428  0.085 -5.054 |
## senior        1.486 20.786  0.292  9.341 |
## student      -0.374  2.626  0.042 -3.564 |
## workman       0.058  0.011  0.000  0.204 |
## 1/day        -0.034  0.029  0.001 -0.395 |
## 1 to 2/week  -0.976 11.275  0.164 -7.000 |
## +2/day        0.289  2.845  0.061  4.278 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## SPC          | 0.451 0.344 0.424 |
## frequency    | 0.258 0.258 0.184 |
## how          | 0.063 0.547 0.521 |
## How          | 0.149 0.139 0.102 |
## sex          | 0.546 0.027 0.010 |

We need up to 6 dimensions to be able to explain even 50 % of the variance. However, it appears for example that the middle class favours unpackaged (loose leaf) tea above other society groups. Further, they do not drink tea with milk as much as the working class. (It is known in sociology that the middle class tends to distinguish itself from lower classes but that their practices differ from those of upper classes.) Raising concerns about the data is that students and females seem to have strong correlation, whereas senior correlates with male. This could indicate a data sampling where a sex is overrepresented in a group. However, investigating the data further falls outside the scope of the current analysis.


Analysis of longitudinal data

This week we apply the methods learned so far to longitudinal data, i.e. data collected over a longer timeperiod, with several observations per subject.

Case 1: RATS!

Overview

The data used in this exercise, based on data by Kimmo Vehkalahti found here, is from a nutrition study on three groups of rats, each rat’s weight being recorded approximately once a week for 9 weeks. That is, the data is longitudinal data tracking the individuals over a time period. The goal is to compare the trends of each group of rats. The data has been wrangled to a long (as opposed to wide) format.

Simply put, the variables are as follows:

##   ID Group  WD Weight Time
## 1  1     1 WD1    240    1
## 2  2     1 WD1    225    1
## 3  3     1 WD1    245    1
## 4  4     1 WD1    260    1
## 5  5     1 WD1    255    1
## 6  6     1 WD1    260    1

For a graphical overview, these are the trends over the 9-week time period across the three groups:

The rats in the first group weigh noticeably less than the other groups, whereas the rats in group 2 seem to at the end of the experiment weigh closer to what the rats in group 3 did at the beginning. Group 2 also includes what seems to be an outlier. With standardised weights, the data appears as follows, allowing for a closer tracking of individuals:

Summary measure

To see the forest for the trees - or the groups for the individuals - it is helpful to plot the data in a way that shows the groups, not individuals.

We shall now plot the summary measures, i.e. single values that represent all the observations of an individual. For this we use only the observations after the first measurement (Day 1).

There are outliers in all Groups, so to remove bias we remove the outliers, i.e. the highest value from group 2 and the lowest values from Group 1 and Group 3. We repeat the plot without these outliers:

Now the groups appear more obviously different. Group 1 is noticeably lighter in weight than the two other groups, and the weights of the rats in Group 3 are very close to each other. We shall verify the difference of Group 2 and 3 using Student’s t-test.

## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3 
##        452.4000        538.2667

The p-value being 5.32e-05 confirms that these Groups indeed differ from each other in a statistically significant way.

Case 2: A Brief Psychiatric Rating Scale (BPRS)

Overview

BPRS is a measure on behaviour and psychiatric symptoms (Wikipedia). In this dataset from here male subjects from two treatment groups have been tested and rated on the BPRS in one initial test (no. 0) plus eight weeks. This too, is a longitudinal dataset that has been wrangled into long form. The data looks something like this:

##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
ggplot(BPRS, aes(x=week, y=bprs, linetype=subject, fill = treatment)) + 
  geom_line() + 
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Week no.") +
  scale_y_continuous(name = "BPRS")+ 
  ggtitle("BPRS over time across 2 treatment groups") +
  theme_bw() + theme(legend.position = "none")

ggplot(BPRS, aes(x = week, y = bprs, group = subject, color = treatment)) +
  geom_text(aes(label = treatment), position = position_dodge(width = 0.3)) +
  theme_bw() + ggtitle("BPRS of individuals in treatment groups")

In the first treatment group, the BPRS decreases over time for most individuals whereas in the second group the variation continues over the course of the testing.

Linear models

We shall fit a linear regression model for the data using the variables treatment and week as explanatory variables, and bprs as the target variable.

## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Based on this model, the treatment is not a statistically significant variable. However, it should be noted that for longitudinal data, a linear model is quite naive - even false - because it does not track individuals over time, i.e. it treats observations over time as independent of each other - which does not correspond to the reality of the situation. Thus, using the same variables as above, we must also fit a random intercepts model, which does take in account the longitudinal aspect of the data and assumes heterogeneity in intercepts.

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Below is a random slope model for the data, which assumes heterogeneity in slopes. As well as an analysis of variance (ANOVA) test on the above random intercept and the below random slope model.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8202  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
## 
## 
## ANOVA test
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results indicate a fairly good fit of the models (p<0.05), i.e. the models are statistically significantly different from each other. We shall thus fit a random intercept and slope model. Instead of being distinguised, we allow for week*treatment interaction. We conduct an ANOVA to compare this with the above models.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0767  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0016  8.0624        
##           week         0.9688  0.9843   -0.51
##  Residual             96.4699  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
## 
## 
## ANOVA test with random intercept model
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 10.443      3    0.01515 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ANOVA test with random slope model
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evidently the difference to the random intercept model is statistically significant, but the difference to the random slope model less so (yet still p<0.1).

Finally, we shall draw a plot that shows the fitted values of the random intercept and slope model.

The difference is minimal even looking at the visualisation, and indeed Vehkalahti & Everitt (2019: Chapter 8) confirm that the two treatment groups are not statistically significant.